This workbook was created using the “dataexpks” template:

https://github.com/DublinLearningGroup/dataexpks

1 Introduction

This workbook performs the basic data exploration of the dataset.

level_exclusion_threshold <- 100

cat_level_count <- 40
hist_bins_count <- 50

2 Load Data

First we load the dataset.

modeldata_tbl <- read_rds("data/modelling1_data_tbl.rds")

rawdata_tbl <- modeldata_tbl %>% select(-sev_data)

rawdata_tbl %>% glimpse()
## Rows: 413,169
## Columns: 14
## $ policy_id      <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11…
## $ exposure       <dbl> 0.09, 0.84, 0.52, 0.45, 0.15, 0.75, 0.81, 0.05, 0.76, …
## $ power          <fct> g, g, f, f, g, g, d, d, d, i, f, f, e, e, e, e, e, e, …
## $ car_age        <int> 0, 0, 2, 2, 0, 0, 1, 0, 9, 0, 2, 2, 0, 0, 0, 0, 0, 0, …
## $ driver_age     <int> 46, 46, 38, 38, 41, 41, 27, 27, 23, 44, 32, 32, 33, 33…
## $ brand          <fct> Japanese (except Nissan) or Korean, Japanese (except N…
## $ gas            <fct> Diesel, Diesel, Regular, Regular, Diesel, Diesel, Regu…
## $ region         <fct> Aquitaine, Aquitaine, Nord-Pas-de-Calais, Nord-Pas-de-…
## $ density        <int> 76, 76, 3003, 3003, 60, 60, 695, 695, 7887, 27000, 23,…
## $ claim_total    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ claim_count    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ cat_driver_age <fct> 43-74, 43-74, 27-42, 27-42, 27-42, 27-42, 27-42, 27-42…
## $ cat_car_age    <fct> 0-1, 0-1, 2-4, 2-4, 0-1, 0-1, 0-1, 0-1, 5-15, 0-1, 2-4…
## $ cat_density    <fct> 41-200, 41-200, 501-4500, 501-4500, 41-200, 41-200, 50…

2.1 Perform Quick Data Cleaning

cleaned_names <- rawdata_tbl %>% names()

data_tbl <- rawdata_tbl %>% set_colnames(cleaned_names)

data_tbl %>% glimpse()
## Rows: 413,169
## Columns: 14
## $ policy_id      <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11…
## $ exposure       <dbl> 0.09, 0.84, 0.52, 0.45, 0.15, 0.75, 0.81, 0.05, 0.76, …
## $ power          <fct> g, g, f, f, g, g, d, d, d, i, f, f, e, e, e, e, e, e, …
## $ car_age        <int> 0, 0, 2, 2, 0, 0, 1, 0, 9, 0, 2, 2, 0, 0, 0, 0, 0, 0, …
## $ driver_age     <int> 46, 46, 38, 38, 41, 41, 27, 27, 23, 44, 32, 32, 33, 33…
## $ brand          <fct> Japanese (except Nissan) or Korean, Japanese (except N…
## $ gas            <fct> Diesel, Diesel, Regular, Regular, Diesel, Diesel, Regu…
## $ region         <fct> Aquitaine, Aquitaine, Nord-Pas-de-Calais, Nord-Pas-de-…
## $ density        <int> 76, 76, 3003, 3003, 60, 60, 695, 695, 7887, 27000, 23,…
## $ claim_total    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ claim_count    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ cat_driver_age <fct> 43-74, 43-74, 27-42, 27-42, 27-42, 27-42, 27-42, 27-42…
## $ cat_car_age    <fct> 0-1, 0-1, 2-4, 2-4, 0-1, 0-1, 0-1, 0-1, 5-15, 0-1, 2-4…
## $ cat_density    <fct> 41-200, 41-200, 501-4500, 501-4500, 41-200, 41-200, 50…

2.2 Create Derived Variables

We now create derived features useful for modelling. These values are new variables calculated from existing variables in the data.

## Rows: 413,169
## Columns: 14
## $ policy_id      <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11…
## $ exposure       <dbl> 0.09, 0.84, 0.52, 0.45, 0.15, 0.75, 0.81, 0.05, 0.76, …
## $ power          <fct> g, g, f, f, g, g, d, d, d, i, f, f, e, e, e, e, e, e, …
## $ car_age        <int> 0, 0, 2, 2, 0, 0, 1, 0, 9, 0, 2, 2, 0, 0, 0, 0, 0, 0, …
## $ driver_age     <int> 46, 46, 38, 38, 41, 41, 27, 27, 23, 44, 32, 32, 33, 33…
## $ brand          <fct> Japanese (except Nissan) or Korean, Japanese (except N…
## $ gas            <fct> Diesel, Diesel, Regular, Regular, Diesel, Diesel, Regu…
## $ region         <fct> Aquitaine, Aquitaine, Nord-Pas-de-Calais, Nord-Pas-de-…
## $ density        <int> 76, 76, 3003, 3003, 60, 60, 695, 695, 7887, 27000, 23,…
## $ claim_total    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ claim_count    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ cat_driver_age <fct> 43-74, 43-74, 27-42, 27-42, 27-42, 27-42, 27-42, 27-42…
## $ cat_car_age    <fct> 0-1, 0-1, 2-4, 2-4, 0-1, 0-1, 0-1, 0-1, 5-15, 0-1, 2-4…
## $ cat_density    <fct> 41-200, 41-200, 501-4500, 501-4500, 41-200, 41-200, 50…

2.3 Check Missing Values

Before we do anything with the data, we first check for missing values in the dataset. In some cases, missing data is coded by a special character rather than as a blank, so we first correct for this.

### _TEMPLATE_
### ADD CODE TO CORRECT FOR DATA ENCODING HERE

With missing data properly encoded, we now visualise the missing data in a number of different ways.

2.3.1 Univariate Missing Data

We first examine a simple univariate count of all the missing data:

row_count <- data_tbl %>% nrow()

missing_univariate_tbl <- data_tbl %>%
  summarise_all(list(~sum(are_na(.)))) %>%
  gather("variable", "missing_count") %>%
  mutate(missing_prop = missing_count / row_count)

ggplot(missing_univariate_tbl) +
  geom_bar(aes(x = fct_reorder(variable, -missing_prop),
               weight = missing_prop)) +
  xlab("Variable") +
  ylab("Missing Value Proportion") +
  theme(axis.text.x = element_text(angle = 90))

We remove all variables where all of the entries are missing

remove_vars <- missing_univariate_tbl %>%
  filter(missing_count == row_count) %>%
  pull(variable)

lessmiss_data_tbl <- data_tbl %>%
  select(-one_of(remove_vars))

With these columns removed, we repeat the exercise.

missing_univariate_tbl <- lessmiss_data_tbl %>%
  summarise_all(list(~sum(are_na(.)))) %>%
  gather("variable", "missing_count") %>%
  mutate(missing_prop = missing_count / row_count)

ggplot(missing_univariate_tbl) +
  geom_bar(aes(x = fct_reorder(variable, -missing_prop),
               weight = missing_prop)) +
  xlab("Variable") +
  ylab("Missing Value Proportion") +
  theme(axis.text.x = element_text(angle = 90))

To reduce the scale of this plot, we look at the top twenty missing data counts.

missing_univariate_top_tbl <- missing_univariate_tbl %>%
  arrange(desc(missing_count)) %>%
  top_n(n = 50, wt = missing_count)

ggplot(missing_univariate_top_tbl) +
  geom_bar(aes(x = fct_reorder(variable, -missing_prop),
               weight = missing_prop)) +
  xlab("Variable") +
  ylab("Missing Value Proportion") +
  theme(axis.text.x = element_text(angle = 90))

2.3.2 Multivariate Missing Data

It is useful to get an idea of what combinations of variables tend to have variables with missing values simultaneously, so to construct a visualisation for this we create a count of all the times given combinations of variables have missing values, producing a heat map for these combination counts.

row_count <- rawdata_tbl %>% nrow()

count_nas <- ~ .x %>% are_na() %>% vec_cast(integer())

missing_plot_tbl <- rawdata_tbl %>%
  mutate_all(count_nas) %>%
  mutate(label = pmap_chr(., str_c)) %>%
  group_by(label) %>%
  summarise_all(list(sum)) %>%
  arrange(desc(label)) %>%
  select(-label) %>%
  mutate(label_count = pmap_int(., pmax)) %>%
  gather("col", "count", -label_count) %>%
  mutate(miss_prop   = count / row_count,
         group_label = sprintf("%6.4f", round(label_count / row_count, 4))
        )

ggplot(missing_plot_tbl) +
  geom_tile(aes(x = col, y = group_label, fill = miss_prop), height = 0.8) +
  scale_fill_continuous() +
  scale_x_discrete(position = "top") +
  xlab("Variable") +
  ylab("Missing Value Proportion") +
  theme(axis.text.x = element_text(angle = 90))

This visualisation takes a little explaining.

Each row represents a combination of variables with simultaneous missing values. For each row in the graphic, the coloured entries show which particular variables are missing in that combination. The proportion of rows with that combination is displayed in both the label for the row and the colouring for the cells in the row.

2.4 Inspect High-level-count Categorical Variables

With the raw data loaded up we now remove obvious unique or near-unique variables that are not amenable to basic exploration and plotting.

coltype_lst <- create_coltype_list(data_tbl)

count_levels <- ~ .x %>% unique() %>% length()

catvar_valuecount_tbl <- data_tbl %>%
  summarise_at(coltype_lst$split$discrete, count_levels) %>%
  gather("var_name", "level_count") %>%
  arrange(-level_count)

print(catvar_valuecount_tbl)
## # A tibble: 8 x 2
##   var_name       level_count
##   <chr>                <int>
## 1 policy_id           413169
## 2 power                   12
## 3 region                  10
## 4 brand                    7
## 5 cat_driver_age           5
## 6 cat_density              5
## 7 cat_car_age              4
## 8 gas                      2
row_count <- nrow(data_tbl)

cat(str_c("Dataset has ", row_count, " rows\n"))
## Dataset has 413169 rows

Now that we a table of the counts of all the categorical variables we can automatically exclude unique variables from the exploration, as the level count will match the row count.

unique_vars <- catvar_valuecount_tbl %>%
  filter(level_count == row_count) %>%
  pull(var_name)

print(unique_vars)
## [1] "policy_id"
explore_data_tbl <- data_tbl %>%
  select(-one_of(unique_vars))

Having removed the unique identifier variables from the dataset, we may also wish to exclude categoricals with high level counts also, so we create a vector of those variable names.

highcount_vars <- catvar_valuecount_tbl %>%
  filter(level_count >= level_exclusion_threshold,
         level_count < row_count) %>%
  pull(var_name)

cat(str_c(highcount_vars, collapse = ", "))

We now can continue doing some basic exploration of the data. We may also choose to remove some extra columns from the dataset.

### You may want to comment out these next few lines to customise which
### categoricals are kept in the exploration.
drop_vars <- c(highcount_vars)

if (length(drop_vars) > 0) {
  explore_data_tbl <- explore_data_tbl %>%
      select(-one_of(drop_vars))

  cat(str_c(drop_vars, collapse = ", "))
}

3 Univariate Data Exploration

Now that we have loaded the data we can prepare it for some basic data exploration. We first exclude the variables that are unique identifiers or similar, and tehen split the remaining variables out into various categories to help with the systematic data exploration.

coltype_lst <- create_coltype_list(explore_data_tbl)

print(coltype_lst)
## $split
## $split$continuous
## [1] "exposure"    "car_age"     "driver_age"  "density"     "claim_total"
## [6] "claim_count"
## 
## $split$discrete
## [1] "power"          "brand"          "gas"            "region"        
## [5] "cat_driver_age" "cat_car_age"    "cat_density"   
## 
## 
## $columns
##       exposure          power        car_age     driver_age          brand 
##   "continuous"     "discrete"   "continuous"   "continuous"     "discrete" 
##            gas         region        density    claim_total    claim_count 
##     "discrete"     "discrete"   "continuous"   "continuous"   "continuous" 
## cat_driver_age    cat_car_age    cat_density 
##     "discrete"     "discrete"     "discrete"

3.1 Logical Variables

Logical variables only take two values: TRUE or FALSE. It is useful to see missing data as well though, so we also plot the count of those.

logical_vars <- coltype_lst$split$logical %>% sort()

for (plot_varname in logical_vars) {
  cat("--\n")
  cat(str_c(plot_varname, "\n"))

  na_count <- explore_data_tbl %>% pull(!! plot_varname) %>% are_na() %>% sum()

  explore_plot <- ggplot(explore_data_tbl) +
    geom_bar(aes(x = !! sym(plot_varname))) +
    xlab(plot_varname) +
    ylab("Count") +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(str_c("Barplot of Counts for Variable: ", plot_varname,
                  " (", na_count, " missing values)")) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

  plot(explore_plot)
}

3.2 Numeric Variables

Numeric variables are usually continuous in nature, though we also have integer data.

numeric_vars <- coltype_lst$split$continuous %>% sort()

for (plot_varname in numeric_vars) {
  cat("--\n")
  cat(str_c(plot_varname, "\n"))

  plot_var <- explore_data_tbl %>% pull(!! plot_varname)
  na_count <- plot_var %>% are_na() %>% sum()

  plot_var %>% summary %>% print

  explore_plot <- ggplot(explore_data_tbl) +
    geom_histogram(aes(x = !! sym(plot_varname)),
                   bins = hist_bins_count) +
    geom_vline(xintercept = mean(plot_var, na.rm = TRUE),
               colour = "red",   size = 1.5) +
    geom_vline(xintercept = median(plot_var, na.rm = TRUE),
               colour = "green", size = 1.5) +
    xlab(plot_varname) +
    ylab("Count") +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(str_c("Histogram Plot for Variable: ", plot_varname,
                  " (", na_count, " missing values)"),
            subtitle = "(red line is mean, green line is median)")

  explore_std_plot <- explore_plot + scale_x_continuous(labels = label_comma())
  explore_log_plot <- explore_plot + scale_x_log10     (labels = label_comma())

  plot_grid(explore_std_plot,
            explore_log_plot, nrow = 2) %>% print()
}
## --
## car_age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   7.000   7.532  12.000 100.000

## --
## claim_count
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03916 0.00000 4.00000

## --
## claim_total
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       0.0       0.0       0.0      83.4       0.0 2036833.0

## --
## density
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2      67     287    1985    1410   27000

## --
## driver_age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   34.00   44.00   45.32   54.00   99.00

## --
## exposure
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002732 0.200000 0.540000 0.561088 1.000000 1.990000

3.3 Categorical Variables

Categorical variables only have values from a limited, and usually fixed, number of possible values

categorical_vars <- coltype_lst$split$discrete %>% sort()

for (plot_varname in categorical_vars) {
  cat("--\n")
  cat(str_c(plot_varname, "\n"))

  na_count <- explore_data_tbl %>% pull(!! plot_varname) %>% are_na() %>% sum()

  plot_tbl <- explore_data_tbl %>%
    pull(!! plot_varname) %>%
    fct_lump(n = cat_level_count) %>%
    fct_count() %>%
    mutate(f = fct_relabel(f, str_trunc, width = 15))

  explore_plot <- ggplot(plot_tbl) +
    geom_bar(aes(x = fct_reorder(f, -n), weight = n)) +
    xlab(plot_varname) +
    ylab("Count") +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(str_c("Barplot of Counts for Variable: ", plot_varname,
                  " (", na_count, " missing values)")) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

  plot(explore_plot)
}
## --
## brand

## --
## cat_car_age

## --
## cat_density

## --
## cat_driver_age

## --
## gas

## --
## power

## --
## region

3.4 Date/Time Variables

Date/Time variables represent calendar or time-based data should as time of the day, a date, or a timestamp.

datetime_vars <- coltype_lst$split$datetime %>% sort()

for (plot_varname in datetime_vars) {
  cat("--\n")
  cat(str_c(plot_varname, "\n"))

  plot_var <- explore_data_tbl %>% pull(!! plot_varname)
  na_count <- plot_var %>% are_na() %>% sum()

  plot_var %>% summary() %>% print()

  explore_plot <- ggplot(explore_data_tbl) +
    geom_histogram(aes(x = !! sym(plot_varname)),
                   bins = hist_bins_count) +
    xlab(plot_varname) +
    ylab("Count") +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(str_c("Barplot of Dates/Times in Variable: ", plot_varname,
                  " (", na_count, " missing values)"))

  plot(explore_plot)
}

4 Bivariate Data Exploration

We now move on to looking at bivariate plots of the data set.

4.1 Facet Plots on Variables

A natural way to explore relationships in data is to create univariate visualisations facetted by a categorical value.

facet_varname <- "region"

facet_count_max <- 3

4.1.1 Logical Variables

For logical variables we facet on barplots of the levels, comparing TRUE, FALSE and missing data.

logical_vars <- logical_vars[!logical_vars %in% facet_varname] %>% sort()


for (plot_varname in logical_vars) {
  cat("--\n")
  cat(str_c(plot_varname, "\n"))

  plot_tbl <- data_tbl %>% filter(!are_na(!! plot_varname))

  explore_plot <- ggplot(plot_tbl) +
    geom_bar(aes(x = !! sym(plot_varname))) +
    facet_wrap(facet_varname, scales = "free") +
    xlab(plot_varname) +
    ylab("Count") +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(str_c(facet_varname, "-Faceted Barplots for Variable: ",
                  plot_varname)) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

  plot(explore_plot)
}

4.1.2 Numeric Variables

For numeric variables, we facet on histograms of the data.

for (plot_varname in numeric_vars) {
  cat("--\n")
  cat(str_c(plot_varname, "\n"))

  plot_tbl <- data_tbl %>% filter(!are_na(!! plot_varname))

  explore_plot <- ggplot(plot_tbl) +
    geom_histogram(aes(x = !! sym(plot_varname)),
                   bins = hist_bins_count) +
    facet_wrap(facet_varname, scales = "free") +
    xlab(plot_varname) +
    ylab("Count") +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(str_c(facet_varname, "-Faceted Histogram for Variable: ",
                  plot_varname)) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

  print(explore_plot + scale_x_continuous(labels = label_comma()))
  print(explore_plot + scale_x_log10     (labels = label_comma()))
}
## --
## car_age

## --
## claim_count

## --
## claim_total

## --
## density

## --
## driver_age

## --
## exposure

4.1.3 Categorical Variables

We treat categorical variables like logical variables, faceting the barplots of the different levels of the data.

categorical_vars <- categorical_vars[!categorical_vars %in% facet_varname] %>% sort()

for (plot_varname in categorical_vars) {
  cat("--\n")
  cat(str_c(plot_varname, "\n"))

  plot_tbl <- data_tbl %>%
    filter(!are_na(!! plot_varname)) %>%
    mutate(
      varname_trunc = fct_relabel(!! sym(plot_varname), str_trunc, width = 10)
      )

  explore_plot <- ggplot(plot_tbl) +
    geom_bar(aes(x = varname_trunc)) +
    facet_wrap(facet_varname, scales = "free") +
    xlab(plot_varname) +
    ylab("Count") +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(str_c(facet_varname, "-Faceted Histogram for Variable: ",
                  plot_varname)) +
    theme(axis.text.x = element_text(angle = 90))

  plot(explore_plot)
}
## --
## brand

## --
## cat_car_age

## --
## cat_density

## --
## cat_driver_age

## --
## gas

## --
## power

4.1.4 Date/Time Variables

Like the univariate plots, we facet on histograms of the years in the dates.

for (plot_varname in datetime_vars) {
  cat("--\n")
  cat(str_c(plot_varname, "\n"))

  plot_tbl <- data_tbl %>% filter(!are_na(!! plot_varname))

  explore_plot <- ggplot(plot_tbl) +
    geom_histogram(aes(x = !! sym(plot_varname)),
                   bins = hist_bins_count) +
    facet_wrap(facet_varname, scales = "free") +
    xlab(plot_varname) +
    ylab("Count") +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(str_c(facet_varname, "-Faceted Histogram for Variable: ",
                  plot_varname))

  plot(explore_plot)
}

5 Frequency Exploration

In this section you can add your own multivariate visualations such as boxplots and so on.

5.1 Explore Claim Rates

The first parameter we wish to focus on is the claim rate - that is, the expected count of claims per year:

\[ \text{Claim Rate} = \frac{\text{Claim Count}}{\text{Time of Exposure}} \]

We can calculate this on an aggregated basis having split the data down various dimensions. To reduce the amount of copy/paste of code, we construct a quick function to produce this.

calculate_claim_rate <- function(data_tbl, ...) {
  group_vars <- enquos(...)
  
  claimrate_tbl <- data_tbl %>%
    group_by(!!! group_vars) %>%
    summarise(
      .groups = "drop",
      total_claimcount = sum(claim_count),
      total_exposure   = sum(exposure),
      claim_rate       = total_claimcount / total_exposure
    )
  
  return(claimrate_tbl)
}

5.1.1 Calculate Basic Claim Rates

First we look at the overall claim rate for the whole book of business:

data_tbl %>% calculate_claim_rate()
## # A tibble: 1 x 3
##   total_claimcount total_exposure claim_rate
##              <int>          <dbl>      <dbl>
## 1            16181        231824.     0.0698

We now want to start looking at the claim rate by different dimensions.

construct_claimrate_data_plots <- function(label, varname, data_tbl) {
  segdata_tbl <- data_tbl %>% calculate_claim_rate(!! sym(varname))
  
  segdata_plot <- ggplot(segdata_tbl) +
    geom_col(aes(x = !! sym(varname), y = claim_rate, fill = total_exposure)) +
    labs(x = label, y = "Claim Rate", fill = "Total Exposure") +
    expand_limits(y = 0) +
    scale_fill_continuous(labels = label_comma()) +
    ggtitle(glue("Claim Rate by {label}"))

  if(varname %in% c("region", "brand")) {
    segdata_plot <- segdata_plot +
      theme(axis.text.x = element_text(angle = 90))
  }
  
  return_lst <- list(
    data_tbl  = segdata_tbl,
    data_plot = segdata_plot
  )

  return(return_lst)
}


vardata_tbl <- tribble(
                 ~label,         ~varname,
                "Power",          "power",
              "Car Age",        "car_age",
           "Driver Age",     "driver_age",
                "Brand",          "brand",
            "Fuel Type",            "gas",
               "Region",         "region",
     "Discrete Car Age",    "cat_car_age",
  "Discrete Driver Age", "cat_driver_age",
     "Discrete Density",    "cat_density"
)


claimrate_vardata_tbl <- vardata_tbl %>%
  mutate(data = map2(label, varname,
                     construct_claimrate_data_plots,
                     data_tbl = data_tbl))

Having generated different summary tables and plots, we now want to inspect them. A few of them need some tweaking for appeal though.

for(i in 1:nrow(claimrate_vardata_tbl)) {
  claimrate_vardata_tbl %>% pull(data) %>% .[[i]] %>% .$data_plot %>% print()
}

Our analysis above calculates a single claim rate for each of the data splits, giving us just a single estimate for the claim rate.

This point estimate is useful, but it is better to get an idea of the possible range of values consistent with the data observed.

We take two approaches to this problem: we use the bootstrap to get a distribution of bootstrap estimates for each claim rate, and we use Bayesian analysis to combine prior knowledge of insurance claim rates with the data to produce a posterior distribution of the claim rate.

5.1.2 Calculate Bootstrap Claim Rate Estimates

n_bootstraps <- 200

As we will re-use the bootstrap a number of times in this project, I will first construct a bootstrap dataset that we can re-use whenever we need it.

bootstrap_index_tbl <- data_tbl %>%
  bootstraps(times = n_bootstraps)

bootstrap_index_tbl %>% glimpse()
## Rows: 200
## Columns: 2
## $ splits <list> [<rsplit[413169 x 152162 x 413169 x 14]>, <rsplit[413169 x 15…
## $ id     <chr> "Bootstrap001", "Bootstrap002", "Bootstrap003", "Bootstrap004"…

To save on storage and memory, full replicas of the dataset are not stored in this datastructure, but rather than index to each of the rows - we then use the analysis() function to return the full sample.

Now that we have our bootstrap data to work from, we construct a bootstrap distribution of claim rates when segmented by our variables.

construct_bootdata <- ~ claimrate_vardata_tbl %>%
  mutate(data = map2(label, varname,
                     construct_claimrate_data_plots,
                     data_tbl = .x %>% analysis()))

extract_data <- ~ .x %>% mutate(data_tbl = map(data, "data_tbl"))

bootstrap_claimrate_tbl <- bootstrap_index_tbl %>%
  mutate(boot_data = map(splits, construct_bootdata),
         tmpdata   = map(boot_data, extract_data)) %>%
  select(id, tmpdata) %>%
  unnest(tmpdata) %>%
  select(-data) %>%
  group_nest(label, varname) %>%
  mutate(newdata = map(data, ~ .x %>% unnest(data_tbl))) %>%
  select(label, varname, newdata)


construct_plot <- function(varname, data_tbl) {
  data_plot <- ggplot(data_tbl) +
    geom_boxplot(aes(x = !! sym(varname), y = claim_rate,
                     group = !! sym(varname))) +
    expand_limits(y = 0) +
    labs(x = varname, y = "Claim Rate", fill = "Mean Total Exposure") +
    ggtitle(glue("Bootstrap Estimates of the Claim Rate - {varname}"))
  
  if(varname %in% c("brand", "region")) {
    data_plot <- data_plot + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
  }

  return(data_plot)
}


bootstrap_claimrate_tbl %>%
  mutate(data_plot = map2(varname, newdata, construct_plot)) %>%
  pull(data_plot)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

This visualisation could mislead, as it suggests we are using the bootstrap on each individual value, whereas we are really sampling the joint distribution across all variables and calculating the joint values of the claim rate. To fix this, we instead plot each joint set of values.

construct_joint_plot <- function(varname, data_tbl) {
  summ_tbl <- data_tbl %>%
    group_by(!! sym(varname)) %>%
    summarise(
      .groups = "drop",
      
      qn = c("p10", "p25", "p50", "p75", "p90"),
      val = quantile(claim_rate, probs = c(0.10, 0.25, 0.5, 0.75, 0.90))
      ) %>%
    pivot_wider(names_from = qn, values_from = val)
  
  data_plot <- ggplot(data_tbl) +
    geom_line(aes(x = !! sym(varname), y = claim_rate, group = id),
              alpha = 0.1, colour = "blue") +
    geom_point(aes(x = !! sym(varname), y = p50), data = summ_tbl) +
    geom_errorbar(aes(x = !! sym(varname), ymin = p10, ymax = p90),
                  data = summ_tbl, width = 0, size = 1) +
    geom_errorbar(aes(x = !! sym(varname), ymin = p25, ymax = p75),
                  data = summ_tbl, width = 0, size = 3, colour = "red") +
    expand_limits(y = 0) +
    labs(x = varname, y = "Claim Rate", fill = "Mean Total Exposure") +
    ggtitle(glue("Bootstrap Estimates of the Claim Rate - {varname}"))
  
  if(varname %in% c("brand", "region")) {
    data_plot <- data_plot + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
  }

  return(data_plot)
}


bootstrap_claimrate_tbl %>%
  mutate(data_plot = map2(varname, newdata, construct_joint_plot)) %>%
  pull(data_plot)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

5.1.3 Calculate Individual Bootstrap Estimates

We now switch our attention to running the bootstrap estimates on a per quantity basis, sampling each subset of the data rather than the whole dataset.

calc_boot_claim_rate <- function(data_split) {
  claimrate_tbl <- data_split %>%
    analysis() %>%
    summarise(
      .groups = "drop",
      
      claim_rate = sum(claim_count) / sum(exposure)
    )
  
  return(claimrate_tbl)
}

calc_singlevar_bootstrap <- function(data_tbl) {
  boot_tbl <- data_tbl %>%
    bootstraps(times = n_bootstraps) %>%
    mutate(claim_rate_data = map(splits, calc_boot_claim_rate)) %>%
    select(id, claim_rate_data) %>%
    unnest(claim_rate_data)
  
  return(boot_tbl)
}

calculate_single_bootstrap_estimates <- function(varname, data_tbl) {
  boot_tbl <- data_tbl %>%
    group_nest(!! sym(varname)) %>%
    mutate(boot_data = map(data, calc_singlevar_bootstrap)) %>%
    select(-data) %>%
    pivot_longer(!boot_data, names_to = "varname", values_to = "value") %>%
    unnest(boot_data) %>%
    mutate(value = value %>% as.character())
  
  return(boot_tbl)
}

single_boot_tbl <- vardata_tbl %>%
  mutate(
    var_data = map(varname, calculate_single_bootstrap_estimates,
                   data_tbl = data_tbl)
    ) %>%
  select(label, var_data) %>%
  unnest(var_data) %>%
  select(label, varname, value, id, claim_rate)

single_boot_tbl %>% glimpse()
## Rows: 40,800
## Columns: 5
## $ label      <chr> "Power", "Power", "Power", "Power", "Power", "Power", "Pow…
## $ varname    <chr> "power", "power", "power", "power", "power", "power", "pow…
## $ value      <chr> "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d"…
## $ id         <chr> "Bootstrap001", "Bootstrap002", "Bootstrap003", "Bootstrap…
## $ claim_rate <dbl> 0.06251592, 0.05946303, 0.06165969, 0.06420425, 0.06182126…

To visualise these we create a facetted boxplot of the bootstrapped claim rates

ggplot(single_boot_tbl %>% mutate(value = str_trunc(value, width = 10))) +
  geom_boxplot(aes(x = value, y = claim_rate)) +
  expand_limits(y = 0) +
  facet_wrap(vars(label), scales = "free") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size = 8))

5.2 Explore Density Patterns

It appears that the density variable captures the population density of the residence for the policy holder. Accidents tend to happen in area of higher population density as this is where the cars are, so it is worth doing a bit of exploration here to see if there is correlation between it and other parameters.

5.2.1 Density vs Region

We first look to see how density is distributed across the region values to get an idea of how this works.

region_density_plot <- ggplot(data_tbl) +
  geom_boxplot(aes(x = region, y = density)) +
  xlab("Region") +
  ylab("Density") +
  expand_limits(y = 0) +
  ggtitle("Boxplot of Density by Region") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

region_density_plot + scale_y_log10     (labels = label_comma())

region_density_plot + scale_y_continuous(labels = label_comma())

There appears to be a pattern here so we want to look at histograms (both standard and log-scale) across the regions.

hist_plot <- ggplot(data_tbl) +
  geom_histogram(aes(x = density), bins = 50) +
  xlab("Density") +
  ylab("Frequency") +
  facet_wrap(vars(region), scales = "free") +
  ggtitle("Facet Plot of Histograms")

hist_plot + scale_x_continuous(labels = label_comma())

hist_plot + scale_x_log10     (labels = label_comma())

5.2.2 Density vs Power

The power value for the vehicle is often a risk factor to be accounted for so we also want to see how density distributes across it.

power_density_plot <- ggplot(data_tbl) +
  geom_boxplot(aes(x = power, y = density)) +
  xlab("Vehicle Power") +
  ylab("Density") +
  expand_limits(y = 0) +
  ggtitle("Boxplot of Density by Vehicle Power") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

power_density_plot + scale_y_log10     (labels = label_comma())

power_density_plot + scale_y_continuous(labels = label_comma())

5.3 Distibution of Claims vs Non-Claims Policy

We also want to investigate whether or not we observe a difference between the distribution of variables for policies with or without a claim.

data_tbl <- data_tbl %>% mutate(has_claim = claim_count > 0)

construct_comparison_plot <- function(label, varname) {
  data_plot <- ggplot(data_tbl) +
    geom_bar(aes(x = !! sym(varname))) +
    facet_wrap(vars(has_claim), scales = "free_y", nrow = 2) +
    scale_y_continuous(label = label_comma()) +
    labs(x = label, y = "Frequency", fill = "Claims") +
    ggtitle(glue("Comparison Plot of {label} by Presence of Claim"))
  
  if(varname %in% c("brand", "region")) {
    data_plot <- data_plot + theme(axis.text.x = element_text(angle = 90))
  }
  
  return(data_plot)
} 


vardata_tbl %>%
  mutate(data_plot = map2(label, varname, construct_comparison_plot)) %>%
  pull(data_plot)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

Finally, we have a quick look at how exposure differs across the claim-free and claim set of policies. This may not be useful from a predictive point of view, but it can be relevant in terms of operations and risk management.

ggplot(data_tbl) +
  geom_histogram(aes(x = exposure), bins = 50) +
  facet_wrap(vars(has_claim), scales = "free_y", nrow = 2) +
  scale_y_continuous(label = label_comma()) +
  ggtitle("Comparison Plot of Exposure by Presence of Claim")

Finally, we want to take a look a density across the two subsets of the data. This may help us check the perceived wisdom of risk increasing for policy holders living in areas with more people.

density_plot <- ggplot(data_tbl) +
  geom_histogram(aes(x = density), bins = 50) +
  facet_wrap(vars(has_claim), scales = "free_y", nrow = 2) +
  scale_y_continuous(label = label_comma()) +
  labs(x = "Density", y = "Frequency", fill = "Claims") +
  ggtitle("Comparison Plot of Density by Presence of Claim")

(density_plot + scale_x_continuous(label = label_comma())) %>% plot()

(density_plot + scale_x_log10     (label = label_comma())) %>% plot()

We do not see a huge difference between these two subsets, though this does not disprove that there is a relationship.

5.4 Geospatial Visualisations

We have some geospatial data on the various regions in France, so we need to load the shapefiles and then add all region-related data to this to create the visualisation plots.

fra_adm_sf <- st_read("geospatial_data/", layer = "FRA_adm1")
## Reading layer `FRA_adm1' from data source `/home/rstudio/insurance-modelling/geospatial_data' using driver `ESRI Shapefile'
## Simple feature collection with 22 features and 12 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -5.143751 ymin: 41.33375 xmax: 9.560416 ymax: 51.0894
## geographic CRS: WGS 84
fra_adm_sf %>% plot()

Now that we have loaded the data we redo

fra_adm_sf <- fra_adm_sf %>%
  mutate(
    region = NAME_1 %>%
      stri_trans_general("LATIN-ASCII") %>%
      str_replace_all(" ", "-")
    )

fra_adm_sf %>% glimpse()
## Rows: 22
## Columns: 14
## $ ID_0      <dbl> 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79,…
## $ ISO       <chr> "FRA", "FRA", "FRA", "FRA", "FRA", "FRA", "FRA", "FRA", "FR…
## $ NAME_0    <chr> "France", "France", "France", "France", "France", "France",…
## $ ID_1      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ NAME_1    <chr> "Alsace", "Aquitaine", "Auvergne", "Île-de-France", "Basse-…
## $ HASC_1    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CCN_1     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ CCA_1     <chr> "42", "72", "83", "11", "25", "26", "53", "24", "21", "94",…
## $ TYPE_1    <chr> "Région", "Région", "Région", "Région", "Région", "Région",…
## $ ENGTYPE_1 <chr> "Region", "Region", "Region", "Region", "Region", "Region",…
## $ NL_NAME_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ VARNAME_1 <chr> "Alsacia|Elsaß", NA, NA, NA, "Baja Normandía|Lower Normandy…
## $ geometry  <MULTIPOLYGON [°]> MULTIPOLYGON (((7.115059 49..., MULTIPOLYGON (…
## $ region    <chr> "Alsace", "Aquitaine", "Auvergne", "Ile-de-France", "Basse-…

We first want to see a plot of the claim rate by region on a choropleth map.

plot_sf <- fra_adm_sf %>%
  left_join(data_tbl %>% calculate_claim_rate(region), by = "region")

ggplot(plot_sf) +
  geom_sf(aes(fill = claim_rate)) +
  geom_sf_text(aes(label = region)) +
  labs(fill = "Claim Rate") +
  theme_void()

6 Severity Investigation

Having focused solely on the frequency side of the data, we now turn our attention to the size of the claims.

6.1 Investigate Claim Size

We now turn our attention to the size of the claim and investigate if any of the policy variables are predictive of claim size.

claimdata_tbl <- modeldata_tbl %>% unnest(sev_data)

claimdata_tbl %>% glimpse()
## Rows: 16,181
## Columns: 15
## $ policy_id      <chr> "33", "41", "92", "96", "96", "142", "182", "377", "37…
## $ exposure       <dbl> 0.75, 0.14, 0.14, 0.62, 0.62, 0.75, 0.68, 0.50, 0.73, …
## $ power          <fct> g, l, d, j, j, e, d, i, f, f, i, j, i, i, e, e, e, g, …
## $ car_age        <int> 1, 5, 0, 0, 0, 0, 10, 0, 0, 8, 0, 8, 0, 0, 0, 0, 0, 2,…
## $ driver_age     <int> 61, 50, 36, 51, 51, 34, 24, 35, 74, 49, 78, 50, 50, 50…
## $ brand          <fct> "Japanese (except Nissan) or Korean", "Japanese (excep…
## $ gas            <fct> Regular, Diesel, Regular, Regular, Regular, Regular, R…
## $ region         <fct> Ile-de-France, Basse-Normandie, Ile-de-France, Ile-de-…
## $ density        <int> 27000, 56, 4792, 27000, 27000, 1565, 3064, 1139, 3132,…
## $ claim_total    <dbl> 302, 2001, 1449, 10870, 10870, 1390, 1418, 747, 3332, …
## $ claim_count    <int> 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, …
## $ cat_driver_age <fct> 43-74, 43-74, 27-42, 43-74, 43-74, 27-42, 23-26, 27-42…
## $ cat_car_age    <fct> 0-1, 5-15, 0-1, 0-1, 0-1, 0-1, 5-15, 0-1, 0-1, 5-15, 0…
## $ cat_density    <fct> 4500+, 41-200, 4500+, 4500+, 4500+, 501-4500, 501-4500…
## $ claim_amount   <int> 302, 2001, 1449, 946, 9924, 1390, 1418, 747, 3332, 295…

We start by constructing a separate table of claim data and perform some basic exploration on that.

We start by plotting the total claim size data set as a single distribution.

ggplot(claimdata_tbl) +
  geom_histogram(aes(x = claim_amount), bins = 50) +
  labs(x = "Claim Size", y = "Frequency") +
  scale_x_log10(labels = label_comma()) +
  ggtitle(glue("Histogram of Claim Size"))

ggplot(claimdata_tbl) +
  geom_histogram(aes(x = claim_amount, y = cumsum(..count..)), bins = 50) +
  scale_x_log10(labels = label_comma()) +
  xlab("Claim Amount") +
  ylab("Cumulative Frequency") +
  ggtitle("Cumulative Histogram of the Claim Size")

facet_vars <- c("power", "brand", "gas", "region", "claim_count")

create_facet_plot <- function(varname, data_tbl) {
  facet_plot <- ggplot(data_tbl) +
    geom_histogram(aes(x = claim_amount), bins = 50) +
    facet_wrap(vars(!! sym(varname)), scales = "free_y") +
    labs(x = "Claim Size", y = "Frequency") +
    scale_x_log10(labels = label_comma()) +
    ggtitle(glue("Histogram of Claim Size Facetted by {varname}"))
  
  facet_cuml_plot <- ggplot(data_tbl) +
    stat_ecdf(aes(x = claim_amount)) +
    facet_wrap(vars(!! sym(varname)), scales = "free_y", shrink = TRUE) +
    scale_x_log10(labels = label_comma()) +
    labs(x = "Claim Size", y = "Cumulative Probability") +
    ggtitle(glue("Cumulative ECDF of Claim Size Facetted by {varname}"))

  facet_ecdf_plot <- ggplot(data_tbl) +
    stat_ecdf(aes(x = claim_amount, colour = !! sym(varname))) +
    scale_x_log10(labels = label_comma()) +
    labs(x = "Claim Size", y = "Cumulative Probability") +
    ggtitle(glue("Cumulative ECDF of Claim Size Coloured by {varname}"))

  
  return(
    list(
      hist_plot = facet_plot,
      cuml_plot = facet_cuml_plot,
      ecdf_plot = facet_ecdf_plot
    )
  )
}


print_plots <- function(x) {
  x$hist_plot %>% print()
  x$cuml_plot %>% print()
  x$ecdf_plot %>% print()
  
  cat("---")
}


facet_vars %>%
  map(create_facet_plot, data_tbl = claimdata_tbl) %>%
  map(print_plots)

## ---

## ---

## ---

## ---

## ---
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL

6.2 Investigate Tail Properties

Severity distributions for liability insurance usually have a heavy right-tail - a small number of claims are massive. As a result, we often look to use various heavy-tailed distributions such as the Pareto to model these large losses.

claimdata_tbl %>%
  pull(claim_amount) %>%
  hill(option = "alpha")

claimdata_tbl %>%
  pull(claim_amount) %>%
  hill(option = "xi")

7 R Environment

sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.2 (2020-06-22)
##  os       Ubuntu 20.04.1 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2021-02-04                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] RSPM (R 4.0.0)
##  backports     1.1.10  2020-09-15 [1] RSPM (R 4.0.2)
##  blob          1.2.1   2020-01-20 [1] RSPM (R 4.0.0)
##  bookdown      0.20    2020-06-23 [1] RSPM (R 4.0.2)
##  broom         0.7.1   2020-10-02 [1] RSPM (R 4.0.2)
##  cellranger    1.1.0   2016-07-27 [1] RSPM (R 4.0.0)
##  class         7.3-17  2020-04-26 [2] CRAN (R 4.0.2)
##  classInt      0.4-3   2020-04-07 [1] RSPM (R 4.0.0)
##  cli           2.1.0   2020-10-12 [1] RSPM (R 4.0.2)
##  codetools     0.2-16  2018-12-24 [2] CRAN (R 4.0.2)
##  colorspace    1.4-1   2019-03-18 [1] RSPM (R 4.0.0)
##  conflicted  * 1.0.4   2019-06-21 [1] RSPM (R 4.0.0)
##  cowplot     * 1.1.0   2020-09-08 [1] RSPM (R 4.0.2)
##  crayon        1.3.4   2017-09-16 [1] RSPM (R 4.0.0)
##  DBI           1.1.0   2019-12-15 [1] RSPM (R 4.0.0)
##  dbplyr        1.4.4   2020-05-27 [1] RSPM (R 4.0.0)
##  digest        0.6.25  2020-02-23 [1] RSPM (R 4.0.0)
##  dplyr       * 1.0.2   2020-08-18 [1] RSPM (R 4.0.2)
##  e1071         1.7-3   2019-11-26 [1] RSPM (R 4.0.0)
##  ellipsis      0.3.1   2020-05-15 [1] RSPM (R 4.0.0)
##  evaluate      0.14    2019-05-28 [1] RSPM (R 4.0.0)
##  evir        * 1.7-4   2018-03-20 [1] RSPM (R 4.0.0)
##  fansi         0.4.1   2020-01-08 [1] RSPM (R 4.0.0)
##  farver        2.0.3   2020-01-16 [1] RSPM (R 4.0.0)
##  forcats     * 0.5.0   2020-03-01 [1] RSPM (R 4.0.0)
##  fs          * 1.5.0   2020-07-31 [1] RSPM (R 4.0.2)
##  furrr       * 0.2.0   2020-10-12 [1] RSPM (R 4.0.2)
##  future      * 1.19.1  2020-09-22 [1] RSPM (R 4.0.2)
##  generics      0.0.2   2018-11-29 [1] RSPM (R 4.0.0)
##  ggplot2     * 3.3.2   2020-06-19 [1] RSPM (R 4.0.1)
##  globals       0.13.1  2020-10-11 [1] RSPM (R 4.0.2)
##  glue        * 1.4.2   2020-08-27 [1] RSPM (R 4.0.2)
##  gtable        0.3.0   2019-03-25 [1] RSPM (R 4.0.0)
##  haven         2.3.1   2020-06-01 [1] RSPM (R 4.0.2)
##  hms           0.5.3   2020-01-08 [1] RSPM (R 4.0.0)
##  htmltools     0.5.0   2020-06-16 [1] RSPM (R 4.0.1)
##  httr          1.4.2   2020-07-20 [1] RSPM (R 4.0.2)
##  jsonlite      1.7.1   2020-09-07 [1] RSPM (R 4.0.2)
##  KernSmooth    2.23-17 2020-04-26 [2] CRAN (R 4.0.2)
##  knitr         1.30    2020-09-22 [1] RSPM (R 4.0.2)
##  labeling      0.3     2014-08-23 [1] RSPM (R 4.0.0)
##  lifecycle     0.2.0   2020-03-06 [1] RSPM (R 4.0.0)
##  listenv       0.8.0   2019-12-05 [1] RSPM (R 4.0.0)
##  lubridate   * 1.7.9   2020-06-08 [1] RSPM (R 4.0.2)
##  magrittr    * 1.5     2014-11-22 [1] RSPM (R 4.0.0)
##  memoise       1.1.0   2017-04-21 [1] RSPM (R 4.0.0)
##  modelr        0.1.8   2020-05-19 [1] RSPM (R 4.0.0)
##  munsell       0.5.0   2018-06-12 [1] RSPM (R 4.0.0)
##  pillar        1.4.6   2020-07-10 [1] RSPM (R 4.0.2)
##  pkgconfig     2.0.3   2019-09-22 [1] RSPM (R 4.0.0)
##  ps            1.4.0   2020-10-07 [1] RSPM (R 4.0.2)
##  purrr       * 0.3.4   2020-04-17 [1] RSPM (R 4.0.0)
##  R6            2.4.1   2019-11-12 [1] RSPM (R 4.0.0)
##  Rcpp          1.0.5   2020-07-06 [1] RSPM (R 4.0.2)
##  readr       * 1.4.0   2020-10-05 [1] RSPM (R 4.0.2)
##  readxl        1.3.1   2019-03-13 [1] RSPM (R 4.0.2)
##  reprex        0.3.0   2019-05-16 [1] RSPM (R 4.0.0)
##  rlang       * 0.4.8   2020-10-08 [1] RSPM (R 4.0.2)
##  rmarkdown     2.4     2020-09-30 [1] RSPM (R 4.0.2)
##  rmdformats    0.3.7   2020-03-11 [1] RSPM (R 4.0.0)
##  rsample     * 0.0.8   2020-09-23 [1] RSPM (R 4.0.2)
##  rstudioapi    0.11    2020-02-07 [1] RSPM (R 4.0.0)
##  rvest         0.3.6   2020-07-25 [1] RSPM (R 4.0.2)
##  scales      * 1.1.1   2020-05-11 [1] RSPM (R 4.0.0)
##  sessioninfo   1.1.1   2018-11-05 [1] RSPM (R 4.0.0)
##  sf          * 0.9-6   2020-09-13 [1] RSPM (R 4.0.2)
##  snakecase   * 0.11.0  2019-05-25 [1] RSPM (R 4.0.0)
##  stringi     * 1.5.3   2020-09-09 [1] RSPM (R 4.0.2)
##  stringr     * 1.4.0   2019-02-10 [1] RSPM (R 4.0.0)
##  tibble      * 3.0.4   2020-10-12 [1] RSPM (R 4.0.2)
##  tidyr       * 1.1.2   2020-08-27 [1] RSPM (R 4.0.2)
##  tidyselect    1.1.0   2020-05-11 [1] RSPM (R 4.0.0)
##  tidyverse   * 1.3.0   2019-11-21 [1] RSPM (R 4.0.0)
##  units         0.6-7   2020-06-13 [1] RSPM (R 4.0.2)
##  utf8          1.1.4   2018-05-24 [1] RSPM (R 4.0.0)
##  vctrs       * 0.3.4   2020-08-29 [1] RSPM (R 4.0.2)
##  withr         2.3.0   2020-09-22 [1] RSPM (R 4.0.2)
##  xfun          0.18    2020-09-29 [1] RSPM (R 4.0.2)
##  xml2          1.3.2   2020-04-23 [1] RSPM (R 4.0.0)
##  yaml          2.2.1   2020-02-01 [1] RSPM (R 4.0.0)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library